library(lubridate)
library(dplyr)
library(tidyr)
library(xts)
library(tseries)
library(forecast)
library(zoo)
options(scipen=999)
In this notebook, we want to analyse whether Trump’s tweets have a statistically significant effect on his approval ratings, and if they do, can we use the popularity of his tweets to predict his approval ratings?
First, import csv data from Kaggle (stored on local machine).
tweets <- read.csv('realdonaldtrump.csv', header=TRUE)
approval <- read.csv('datasets_683990_1200219_trump-approval-ratings_approval_topline.csv', header=TRUE)
head(tweets)
head(approval)
Now, pull out favourites, approval %age estimate & disapproval %age estimate.
tweets_date <- format(as.Date(tweets$date), "%m/%d/%Y")
approval_date <- strftime(as.Date(approval$modeldate, format="%m/%d/%Y"), format="%m/%d/%Y")
approval_df <- data.frame(approval_date, approval$approve_estimate)
disapproval_df <- data.frame(approval_date, approval$disapprove_estimate)
tweets_df <- data.frame(tweets_date, tweets$favorites)
head(disapproval_df)
tail(tweets_df)
Now we need to match these up by date. First, we need to take average of favourites & (dis)approval per day - use dplyr to group by date and take average.
tweets_fav <- tweets_df %>%
group_by(tweets_date) %>%
mutate(mean_favorites = mean(tweets.favorites)) %>%
select(-tweets.favorites) %>%
distinct()
approval_means <- approval_df %>%
group_by(approval_date) %>%
mutate(mean_approval = mean(approval.approve_estimate)) %>%
select(-approval.approve_estimate) %>%
distinct()
disapproval_means <- disapproval_df %>%
group_by(approval_date) %>%
mutate(mean_disapproval = mean(approval.disapprove_estimate)) %>%
select(-approval.disapprove_estimate) %>%
distinct()
approval_means_asc = approval_means[order(as.Date(approval_means$approval_date, format = "%m/%d/%Y")),]
disapproval_means_asc = disapproval_means[order(as.Date(approval_means$approval_date, format = "%m/%d/%Y")),]
head(tweets_fav)
head(approval_means_asc)
tail(disapproval_means_asc)
Match up starting dates.
tweets_ts <- tweets_fav[which(tweets_fav$tweets_date == "01/23/2017"):which(tweets_fav$tweets_date == "05/29/2020"),]
approve_ts <- approval_means_asc[which(approval_means_asc$approval_date == "01/23/2017"):which(approval_means_asc$approval_date == "05/29/2020"),]
disapprove_ts <- disapproval_means_asc[which(disapproval_means_asc$approval_date == "01/23/2017"):which(disapproval_means_asc$approval_date == "05/29/2020"),]
Finally, merge into one dataframe.
head(ts_complete)
head(ts_complete)
Create time series objects.
favourites_ts <- ts(ts_complete$mean_favs, start = 1, frequency = 1)
approval_ts <- ts(ts_complete$mean_app, start = 1, frequency = 1)
disapproval_ts <- ts(ts_complete$mean_disapp, start = 1, frequency = 1)
Plot and compare. First, approval ratings.
plot(approval_ts, type="l", col = "green", ylim = c(35,60))
lines(disapproval_ts, type="l", col="red")
Now, favourites against both (just the most recent 365 days).
mar.default <- c(5,4,4,2) + 0.1
par(mar = mar.default + c(0, 0, 0, 4))
plot(favourites_ts[844:1209], type="l", col="blue", ylab = "Favourites")
par(new = TRUE)
plot(approval_ts[844:1209], type = "l", axes = FALSE, bty = "n", col="green", xlab = "", ylab = "")
par(new = TRUE)
plot(disapproval_ts[844:1209], type = "l", axes = FALSE, bty = "n", col="red", xlab = "", ylab = "")
axis(side=4)
mtext("Approval percentage", side=4, line = 3)
legend("topleft", c("Favourites","Approval", "Disapproval"), lty = c(1,1), col=c("Blue", "Green", "Red"))
We see some possible correlation, but we need to go deeper.
Now we have our data nicely set up, it’s time to do some analysis. First, carry out a simple regression between favourites and approval. First, check if there is any correlation. First, plot.
plot(ts_complete$mean_favs, ts_complete$mean_app)
Seems like a pretty random spread - now check correlation.
cor(ts_complete$mean_app, ts_complete$mean_favs, method = c("pearson"))
[1] 0.1947603
Slight positive correlation - now check disapproval.
cor(ts_complete$mean_disapp, ts_complete$mean_favs, method = c("pearson"))
[1] -0.1408452
as.numeric(ts_complete$date)[1:5]
[1] 17189 17190 17191 17192 17193
Negative correlation, as exppected. Now we regress approval & disapproval on favourites. Inlcude date as a predictor, as we expect a time trend (note, we subtract 17188 from date so that it starts at 1).
app_lm <-lm(mean_app ~ mean_favs + as.numeric(I(date-17188)), data=ts_complete)
disapp_lm <-lm(mean_disapp ~ mean_favs + as.numeric(I(date-17188)), data=ts_complete)
summary(app_lm)
Call:
lm(formula = mean_app ~ mean_favs + as.numeric(I(date - 17188)),
data = ts_complete)
Residuals:
Min 1Q Median 3Q Max
-4.0401 -0.8551 -0.0090 0.7957 7.4324
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.503592015 0.138457263 285.313 < 0.0000000000000002 ***
mean_favs 0.000004331 0.000001390 3.115 0.00188 **
as.numeric(I(date - 17188)) 0.002394455 0.000133424 17.946 < 0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.586 on 1206 degrees of freedom
Multiple R-squared: 0.2407, Adjusted R-squared: 0.2394
F-statistic: 191.2 on 2 and 1206 DF, p-value: < 0.00000000000000022
summary(disapp_lm)
Call:
lm(formula = mean_disapp ~ mean_favs + as.numeric(I(date - 17188)),
data = ts_complete)
Residuals:
Min 1Q Median 3Q Max
-12.9555 -0.7971 0.0671 0.9170 4.1050
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.268797601 0.167159784 324.652 < 0.0000000000000002 ***
mean_favs -0.000006925 0.000001679 -4.126 0.0000395 ***
as.numeric(I(date - 17188)) -0.000444903 0.000161083 -2.762 0.00583 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.914 on 1206 degrees of freedom
Multiple R-squared: 0.026, Adjusted R-squared: 0.02438
F-statistic: 16.1 on 2 and 1206 DF, p-value: 0.0000001263
We see that all covariates are statistically significant in both models.
Now we have shown a relationship between favourites and approval/disapprocal, we want to use ARIMA models to predict twitter favourites, and use this to in turn predict approval ratings.
First, check whether favourites are stationary.
kpss.test(favourites_ts)
p-value smaller than printed p-value
KPSS Test for Level Stationarity
data: favourites_ts
KPSS Level = 2.3025, Truncation lag parameter = 7, p-value = 0.01
Significant evidence that this is NOT stationary at 1% level. So, try auto.arima - we expect some differencing to be necessary to give stationarity.
auto.arima(favourites_ts, ic = "aic", seasonal = TRUE, trace = TRUE)
Fitting models using approximations to speed things up...
ARIMA(2,1,2) with drift : 28266.8
ARIMA(0,1,0) with drift : 28722.87
ARIMA(1,1,0) with drift : 28492.38
ARIMA(0,1,1) with drift : 28315.32
ARIMA(0,1,0) : 28720.88
ARIMA(1,1,2) with drift : 28257.47
ARIMA(0,1,2) with drift : 28286.1
ARIMA(1,1,1) with drift : 28267.67
ARIMA(1,1,3) with drift : 28257.93
ARIMA(0,1,3) with drift : 28275.17
ARIMA(2,1,1) with drift : 28265.45
ARIMA(2,1,3) with drift : 28269.16
ARIMA(1,1,2) : 28255.61
ARIMA(0,1,2) : 28284.1
ARIMA(1,1,1) : 28265.68
ARIMA(2,1,2) : 28264.95
ARIMA(1,1,3) : 28256.02
ARIMA(0,1,1) : 28313.32
ARIMA(0,1,3) : 28273.18
ARIMA(2,1,1) : 28263.63
ARIMA(2,1,3) : 28267.31
Now re-fitting the best model(s) without approximations...
ARIMA(1,1,2) : 28279.07
Best model: ARIMA(1,1,2)
Series: favourites_ts
ARIMA(1,1,2)
Coefficients:
ar1 ma1 ma2
0.6313 -1.3188 0.3471
s.e. 0.0850 0.0964 0.0874
sigma^2 estimated as 854910514: log likelihood=-14135.53
AIC=28279.07 AICc=28279.1 BIC=28299.45
So, we can use an ARIMA(1,1,2) model to forecast this TS. Use forecast package to do so.
arima_fav <- arima(favourites_ts, order = c(1,1,2), include.mean = T)
fc <- forecast(arima_fav, level = 95)
plot(fc, xlim = c(1200, 1220))
We can use this data, along with our linear model, to predict approval rating from tweet favourites.